25 research outputs found
Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol
Distributed Machine Learning (DML) systems are utilized to enhance the speed
of model training in data centers (DCs) and edge nodes. The Parameter Server
(PS) communication architecture is commonly employed, but it faces severe
long-tail latency caused by many-to-one "incast" traffic patterns, negatively
impacting training throughput. To address this challenge, we design the
\textbf{L}oss-tolerant \textbf{T}ransmission \textbf{P}rotocol (LTP), which
permits partial loss of gradients during synchronization to avoid unneeded
retransmission and contributes to faster synchronization per iteration. LTP
implements loss-tolerant transmission through \textit{out-of-order
transmission} and \textit{out-of-order Acknowledges (ACKs)}. LTP employs
\textit{Early Close} to adjust the loss-tolerant threshold based on network
conditions and \textit{Bubble Filling} for data correction to maintain training
accuracy. LTP is implemented by C++ and integrated into PyTorch. Evaluations on
a testbed of 8 worker nodes and one PS node demonstrate that LTP can
significantly improve DML training task throughput by up to 30x compared to
traditional TCP congestion controls, with no sacrifice to final accuracy.Comment: This paper will be published on IWQoS 2023. Preview version onl
OSP: Boosting Distributed Model Training with 2-stage Synchronization
Distributed deep learning (DDL) is a promising research area, which aims to
increase the efficiency of training deep learning tasks with large size of
datasets and models. As the computation capability of DDL nodes continues to
increase, the network connection between nodes is becoming a major bottleneck.
Various methods of gradient compression and improved model synchronization have
been proposed to address this bottleneck in Parameter-Server-based DDL.
However, these two types of methods can result in accuracy loss due to
discarded gradients and have limited enhancement on the throughput of model
synchronization, respectively. To address these challenges, we propose a new
model synchronization method named Overlapped Synchronization Parallel (OSP),
which achieves efficient communication with a 2-stage synchronization approach
and uses Local-Gradient-based Parameter correction (LGP) to avoid accuracy loss
caused by stale parameters. The prototype of OSP has been implemented using
PyTorch and evaluated on commonly used deep learning models and datasets with a
9-node testbed. Evaluation results show that OSP can achieve up to 50\%
improvement in throughput without accuracy loss compared to popular
synchronization models.Comment: Copyright Owner/Author | ACM 2023. This is the author's version of
the work. It is posted here for your personal use. Not for redistribution.
The definitive Version of Record will be published in ICPP 202
Study on heat fluxes and their effect on electrode material removal on the copper electrode of a field-distortion gas spark switch
Gas switch is one of the key elements in pulsed-power devices, and electrode erosion is a key restrictive factor in high-power gas switch development and application. According to the thermal equilibrium equation near the electrode surface, this paper calculates electrode heat fluxes and their peak powers caused under different discharge conditions, and analyses their effects on the removal method of electrode materials. When discharge current is not too high, calculation results indicate that the arc joule heat is the main cause of electrode erosion and solid material removal may appear. Vaporization of electrode material is the main cause of electrode erosion when discharge current is high enough
Robustness testing for software components
AbstractComponent-based development allows one to build software from existing components and promises to improve software reuse and reduce costs. For critical applications, the user of a component must ensure that it fits the requirements of the application. To achieve this, testing is a well-suited means when the source code of the components is not available. Robustness testing is a testing methodology to detect the vulnerabilities of a component under unexpected inputs or in a stressful environment. As components may fail differently in different states, we use a state machine based approach to robustness testing. First, a set of paths is generated to cover transitions of the state machine, and it is used by the test cases to bring the component into a specific control state. Second, method calls with invalid inputs are fed to the component in different states to test the robustness. By traversing the paths, the test cases cover more states and transitions compared to stateless API testing. We apply our approach to several components, including open source software, and compare our results with existing approaches
Failure times prediction of field-distortion gas switch based on electrode surface roughness
Prediction of switch failure times has a great influence on determining the repair cycle of gas switch and pulse power system, preventing accidents, and reducing cost. In this paper, electrode surface roughness (ESR) is proposed to analyze switch performance and predict switch failure times. According to the one-dimensional equation of heat conduction and the thermal equilibrium equation near the electrode surface, the etch pit depth can be calculated with different discharge conditions. And then, the electrode surface roughness has been obtained by calculating the deepest etch pits depth and the burr peak height in the electrode erosion region for the same discharge condition. The switch failure times can be predicted by the trend of the ESR according to discharge times. Experimental results indicate that the calculation model of switch failure times can be used to predict the switch failure times effectively
Machine Learning Aided Prediction of Glass-Forming Ability of Metallic Glass
The prediction of the glass-forming ability (GFA) of metallic glasses (MGs) can accelerate the efficiency of their development. In this paper, a dataset was constructed using experimental data collected from the literature and books, and a machine learning-based predictive model was established to predict the GFA. Firstly, a classification model based on the size of the critical diameter (Dmax) was established to determine whether an alloy system could form a glass state, with an accuracy rating of 0.98. Then, regression models were established to predict the crystallization temperature (Tx), glass transition temperature (Tg), and liquidus temperature (Tl) of MGs. The R2 of the prediction model obtained in the test set was greater than 0.89, which showed that the model had good prediction accuracy. The key features used by the regression models were analyzed using variance, correlation, embedding, recursive, and exhaustive methods to select the most important features. Furthermore, to improve the interpretability of the prediction model, feature importance, partial dependence plot (PDP), and individual conditional expectation (ICE) methods were used for visualization analysis, demonstrating how features affect the target variables. Finally, taking Zr-Cu-Ni-Al system MGs as an example, a prediction model was established using a genetic algorithm to optimize the alloy composition for high GFA in the compositional space, achieving the optimal design of alloy composition
Prediction of the Fatigue Strength of Steel Based on Interpretable Machine Learning
Most failures in steel materials are due to fatigue damage, so it is of great significance to analyze the key features of fatigue strength (FS) in order to improve fatigue performance. This study collected data on the fatigue strength of steel materials and established a predictive model for FS based on machine learning (ML). Three feature-construction strategies were proposed based on the dataset, and compared on four typical ML algorithms. The combination of Strategy Ⅲ (composition, heat-treatment, and atomic features) and the GBT algorithm showed the best performance. Subsequently, input features were selected step by step using methods such as the analysis of variance (ANOVA), embedded method, recursive method, and exhaustive method. The key features affecting FS were found to be TT, mE, APID, and Mo. Based on these key features and Bayesian optimization, an ML model was established, which showed a good performance. Finally, Shapley additive explanations (SHAP) and symbolic regression (SR) are introduced to improve the interpretability of the prediction model. It had been discovered through SHAP analysis that TT and Mo had the most significant impact on FS. Specifically, it was observed that 160 0.15 was beneficial for increasing the value of FS. SR was used to establish a significant mathematical relationship between these key features and FS
Optimal Design of the Austenitic Stainless-Steel Composition Based on Machine Learning and Genetic Algorithm
As the fourth paradigm of materials research and development, the materials genome paradigm can significantly improve the efficiency of research and development for austenitic stainless steel. In this study, by collecting experimental data of austenitic stainless steel, the chemical composition of austenitic stainless steel is optimized by machine learning and a genetic algorithm, so that the production cost is reduced, and the research and development of new steel grades is accelerated without reducing the mechanical properties. Specifically, four machine learning prediction models were established for different mechanical properties, with the gradient boosting regression (gbr) algorithm demonstrating superior prediction accuracy compared to other commonly used machine learning algorithms. Bayesian optimization was then employed to optimize the hyperparameters in the gbr algorithm, resulting in the identification of the optimal combination of hyperparameters. The mechanical properties prediction model established at this stage had good prediction accuracy on the test set (yield strength: R2 = 0.88, MAE = 4.89 MPa; ultimate tensile strength: R2 = 0.99, MAE = 2.65 MPa; elongation: R2 = 0.84, MAE = 1.42%; reduction in area: R2 = 0.88, MAE = 1.39%). Moreover, feature importance and Shapley Additive Explanation (SHAP) values were utilized to analyze the interpretability of the performance prediction models and to assess how the features influence the overall performance. Finally, the NSGA-III algorithm was used to simultaneously maximize the mechanical property prediction models within the search space, thereby obtaining the corresponding non-dominated solution set of chemical composition and achieving the optimization of austenitic stainless-steel compositions